The following project will be analyzing data to be able to predict the profit percentage that the company StockX is proposed to make from the shoes that they resell. The data will be explored analytically to be able to understand the variance within and between variables. Through stratified sampling the data will be split and cross validation will be used as well to be able to fit into various machine learning models. After comparing models, the testing data will be fit to the model that produces the best RMSE and \(R^2\) value.
StockX is a online marketplace that mainly focuses on reselling sneakers. As a consumer, you are not only able to buy but also sell your own pair of sneakers.
StockX is a third party platform that primarily buys and sells shoes. For people who are not familiar with the sneaker industry, it can be helpful to think of the StockX business as such: think of a shoe as a stock where different factors affect the price of the shoe at a certain time. For stocks, certain socioeconomic factors such as investor demand influence the price of stocks. If the number of buy orders outnumber the number of sell orders, stock prices rise, and vice versa. The same holds true for shoes, where different factors such as innovation or the name/brand associated with the shoe influence how valuable a shoe is within the shoe market. Organizations such as Yeezy and Offwhite (brands used in this data set) implement basic supply and demand techniques by limiting scale production to increase demand and value for their shoes and other items.
Libraries Used for this Project
library(readr)
library(tidyverse)
library(tidymodels)
library(ISLR)
library(rpart.plot)
library(vip)
library(janitor)
library(xgboost)
library(ISLR2)
library(discrim)
library(poissonreg)
library(corrr)
library(corrplot)
library(klaR)
library(pROC)
library(glmnet)
library(dplyr)
library(randomForest)
library(rpart)
library(ranger)
library(vip)
library(lubridate)
library(kernlab)
library(kknn)
tidymodels_prefer()
Read the data file into R. This file was a csv file. I found this file on the StockX website that challenged its audience in 2019 to enter a data contest using the same data that is used for this project. The link to the original website can be found here.
sneakers <- read.csv("Data/StockX-Data.csv")
Next we will do some cleaning. After reading the file in I noticed that there were certain variables that were read in as characters, however they will need to be used as integers therefore we will be converting them to integers before we continue. Likewise the dates that are provided were read in as characters and we will have to switch them to be read as a date format.
# Clean names
sneakers <- sneakers %>%
clean_names()
#Sale.Price from chr to int
sneakers$sale_price <- parse_number(sneakers$sale_price)
sneakers$sale_price <- as.integer(sneakers$sale_price)
# Retail.Price from chr to int
sneakers$retail_price <- parse_number(sneakers$retail_price)
sneakers$retail_price <- as.integer(sneakers$retail_price)
# Order.Date from chr to date
sneakers$order_date <- as.Date(sneakers$order_date, "%m/%d/%y")
# Release.Date from chr to date
sneakers$release_date <- as.Date(sneakers$release_date, "%m/%d/%y")
#Check for Missing Values in the dataset
sum(is.na(sneakers))
## [1] 0
There are no missing values within the data set, therefore we will continue to use the entire data set.
There are a couple new variables that can be added to this data set that can further enhance the analysis. As of right now, the variables containing dates serve us no purpose however we can convert them to weeks relative to each other to be able to incorporate them into our analysis. Similarly, we can use the days of the week that each shoe was ordered to see if it influences the profit percentage in any way. Another very important variable that is missing in this data set is the profit that each shoe makes. This can easily be calculated by finding the difference between the sale price and the retail price of each shoe. Likewise the profit percentage can be calculated as well by dividing the profit by the retail price of each shoe.
# Add the number of weeks between the release date and the order date
sneakers$weeks <- round(difftime(sneakers$order_date, sneakers$release_date, units = "weeks"))
# Add in the day of the week that the shoe was ordered/purchased
sneakers$day_of_week <- format(as.Date(sneakers$order_date), "%A")
# Add how much profit was made on each sale
sneakers$profit <- sneakers$sale_price - sneakers$retail_price
# Add the percentage of the profit made
sneakers$profit_perc <- round((sneakers$profit / sneakers$retail_price) * 100)
The entire exploratory data analysis will be done on the entire data
set. The data set overall has 99,956 observations, with originally 8
variables, however after adding
new variables to the data set it actually has 12 variables. Each
observation represents the sale of a single shoe.
Shoe Size
sneakers %>%
count(shoe_size) %>%
arrange(-n) %>%
head()
## shoe_size n
## 1 10.0 11093
## 2 9.0 9706
## 3 11.0 9251
## 4 10.5 8784
## 5 9.5 8685
## 6 12.0 7297
There are 26 different shoe sizes in the dataset. The top three sizes across the entire data set are: * 10 (11093)
9 (9706)
11 (9251)
Brands
sneakers %>%
select(brand) %>%
distinct() %>%
group_by() %>%
head()
## # A tibble: 2 × 1
## brand
## <chr>
## 1 " Yeezy"
## 2 "Off-White"
There are only two brands within the 99,956 different observations found in this dataset: Yeezy and Off-White.
Next we will be separating the two brands into their own data frames to be able to analyze them separately.
Off-White
# Create a new data frame for all the Off-White brand sneakers
offwhite <- sneakers %>%
select(order_date, brand, sneaker_name, sale_price, retail_price, release_date, shoe_size, buyer_region, weeks, day_of_week, profit, profit_perc) %>%
filter(brand == "Off-White")
offwhite %>%
count(shoe_size) %>%
arrange(-n) %>%
head()
## shoe_size n
## 1 10.0 3654
## 2 11.0 3109
## 3 9.0 2703
## 4 10.5 2537
## 5 9.5 2255
## 6 12.0 2255
This Off-White dataset is missing the size 13.0 and 14.5, therefore there are only 24 sizes as opposed to 26. The top three sizes with the most shoes for Off-White are: * 10 (3654)
11 (3109)
9 (2703)
Yeezy
# Yeezy dataframe
# Clean brand names to not include spaces
sneakers$brand <- gsub(" ", "", sneakers$brand)
# Create a new data frame for all the Yeezy brand sneakers
yeezy <- sneakers %>%
select(order_date, brand, sneaker_name, sale_price, retail_price, release_date, shoe_size, buyer_region, weeks, day_of_week, profit, profit_perc) %>%
filter(brand == "Yeezy")
yeezy %>%
count(shoe_size) %>%
arrange(-n) %>%
head()
## shoe_size n
## 1 10.0 7439
## 2 9.0 7003
## 3 9.5 6430
## 4 10.5 6247
## 5 11.0 6142
## 6 12.0 5042
This Yeezy dataset is missing the size 15, therefore there are only 25 sizes as opposed to 26. The top three sizes with the most shoes for Yeezy are: * 10 (7439)
9 (7003)
9.5 (6430)
Within each brand there are different types of sneakers. Next we will look through each of the data frames for both brands to see how many individual sneakers there are.
#Find the number of sneakers listed for Off-White and the quantities of each.
offwhite %>%
count(sneaker_name) %>%
arrange(-n)%>%
head()
## sneaker_name n
## 1 Air-Jordan-1-Retro-High-Off-White-University-Blue 4635
## 2 Nike-Air-Presto-Off-White-Black-2018 1884
## 3 Nike-Air-Presto-Off-White-White-2018 1883
## 4 Nike-Air-VaporMax-Off-White-2018 1591
## 5 Nike-Blazer-Mid-Off-White-All-Hallows-Eve 1435
## 6 Nike-Blazer-Mid-Off-White-Grim-Reaper 1398
For the Off-White brand there are 30 individual sneaker types, where the sneakers with the most sales are: * Air-Jordan-1-Retro-High-Off-White-University-Blue (4635)
Nike-Air-Presto-Off-White-Back-2018 (1884)
Nike-Air-Presto-Off-White-White-2018 (1883)
#Find the number of sneakers listed for Yeezy and the quantities of each.
yeezy %>%
count(sneaker_name) %>%
arrange(-n) %>%
head()
## sneaker_name n
## 1 adidas-Yeezy-Boost-350-V2-Butter 11423
## 2 Adidas-Yeezy-Boost-350-V2-Beluga-2pt0 10395
## 3 Adidas-Yeezy-Boost-350-V2-Zebra 10110
## 4 Adidas-Yeezy-Boost-350-V2-Blue-Tint 9297
## 5 Adidas-Yeezy-Boost-350-V2-Cream-White 9097
## 6 Adidas-Yeezy-Boost-350-V2-Sesame 5553
For the Yeezy brand there are 20 individual sneaker types, where the sneakers with the most sales are: * Adidas-Yeezy-Boost-350-V2-Butter (11423)
Adidas-Yeezy-Boost-350-V2-Beluga-2pt0 (10395)
Adidas-Yeezy-Boost-350-V2-Zebra (10110)
In total there are 50 different kinds of sneakers across 2 different brands, 30 for Off-White, and 20 for Yeezy.
Next we will look at sales for both the Off-White brand and the Yeezy brand, and compare them to the average retail price, average sale price, and max sale price for each shoe and their kind.
# Find average sale and retail price and max sale price for each sneaker, arranged by highest average sale price to lowest
offwhite %>%
group_by(sneaker_name) %>%
summarize(avg_retail_price=mean(retail_price), avg_sale_price=mean(sale_price), max_sale_price=max(sale_price)) %>%
arrange(-avg_sale_price) %>%
head()
## # A tibble: 6 × 4
## sneaker_name avg_retail_price avg_sale_price max_sale_price
## <chr> <dbl> <dbl> <int>
## 1 Air-Jordan-1-Retro-High-Off-Wh… 190 1826. 2950
## 2 Air-Jordan-1-Retro-High-Off-Wh… 190 1770. 4050
## 3 Nike-Air-Presto-Off-White 160 1236. 2160
## 4 Nike-Air-Force-1-Low-Virgil-Ab… 150 976. 1500
## 5 Nike-Air-Max-97-Off-White-Elem… 190 894. 1290
## 6 Nike-Air-VaporMax-Off-White 250 857. 2399
Based on our output we conclude the follwoing:
The most expensive sales for Off-White shoes are: * Air-Jordan-1-Retro-High-Off-White-Chicago (4050)
Air-Jordan-1-Retro-High-Off-White-University-Blue (3680)
Air-Jordan-1-Retro-High-Off-White-White (2950)
The most expensive Off-White shoes on average for sales price are: * Air-Jordan-1-Retro-High-Off-White-White (1826.07)
Air-Jordan-1-Retro-High-Off-White-Chicago (1769)
Nike-Air-Presto-Off-White (1236.05)
# Find average sale and retail price and max sale price for each sneaker, arranged by highest average sale price to lowest
yeezy %>%
group_by(sneaker_name) %>%
summarize(avg_retail_price=mean(retail_price), avg_sale_price=mean(sale_price), max_sale_price=max(sale_price)) %>%
arrange(-avg_sale_price) %>%
head()
## # A tibble: 6 × 4
## sneaker_name avg_retail_price avg_sale_price max_sale_price
## <chr> <dbl> <dbl> <int>
## 1 Adidas-Yeezy-Boost-350-Low-Tur… 200 1532. 2300
## 2 Adidas-Yeezy-Boost-350-Low-Oxf… 200 1012. 1470
## 3 Adidas-Yeezy-Boost-350-Low-Moo… 200 997. 2000
## 4 Adidas-Yeezy-Boost-350-Low-Pir… 200 984. 1455
## 5 Adidas-Yeezy-Boost-350-V2-Core… 220 938. 1575
## 6 Adidas-Yeezy-Boost-350-Low-Pir… 200 895. 1300
The most expenisive sales for Yeezy shoes are: * Adidas-Yeezy-Boost-350-Low-Turtledove (2300)
Adidas-Yeezy-Boost-350-Low-Moonrock (2000)
Adidas-Yeezy-Boost-350-V2-Blue-Tint (2000)
The most expensive Yeezy shoes on average for sales price are: * Adidas-Yeezy-Boost-350-Low-Turtledove (1531.66)
Adidas-Yeezy-Boost-350-Low-Oxford-Tan (1011.51)
Adidas-Yeezy-Boost-350-Low-Moonrock (996.71)
Next we will find out how many buyer regions there are and what are they:
sneakers %>%
count(buyer_region) %>%
arrange(-n) %>%
head()
## buyer_region n
## 1 California 19349
## 2 New York 16525
## 3 Oregon 7681
## 4 Florida 6376
## 5 Texas 5876
## 6 New Jersey 4720
There are a total of 51 regions where the shoes in the data set was sold, where the top regions with the most sales are: * California (19349)
New York (16525)
Oregon (7681)
The regions for the most part are states, however District of Colombia is considered a region in this data set, which is not a state but rather the capital of the U.S.
In this next section we will look at the different variables and observe how they perform overall, in order to determine any form of correlation.
The following graph shows the number of shoes sold for each of the different sneaker types. It is evident that there are certain shoes that are more popular than others.
sneakers %>%
count(sneaker_name) %>%
mutate(sneaker_name = fct_reorder(sneaker_name, n)) %>%
ggplot(aes(sneaker_name, n)) +
geom_col(fill = "limegreen") +
labs(title = "Total Sales Per Sneaker", x = "Sneaker Name", y = "Number of Shoes") +
coord_flip()
This next graph shoes the profit that was made for each individual shoe. Here we can notice that the shoes with the most profit do not exactly match the most popular shoes, therefore we can conclude that some shoes run for higher prices, and are available in smaller quantities.
sneakers %>%
group_by(sneaker_name) %>%
summarise(avg_profit = round(mean(profit))) %>%
ggplot(aes(reorder(sneaker_name, avg_profit), avg_profit)) +
geom_col(fill = "limegreen") +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Average Profit Per Sneaker", x = "Sneaker", y = "Average Profit") +
coord_flip()
The nest few graphs will look into the statistics for the buyer regions, or in other words the states. We would like to see if the location of the buy has any importance in the way that it influences the sales of the shoes.
In this first graph, we examine the number of shoes sold in each state. It is clear that California has a high number of sales, with New York following.
sneakers %>%
group_by(buyer_region) %>%
count() %>%
ggplot(aes(reorder(buyer_region, n), n)) +
geom_col(fill = "limegreen") +
labs(title = "Total Number of Shoes Sold Per State", x = "State", y = "Number of Shoes") +
coord_flip()
This graph shows the total amount in sales that each state made. It is observed that the number of sales are correlated with the total amount of money made from sales.
sneakers %>%
group_by(buyer_region) %>%
summarise(total_purchased = sum(sale_price)) %>%
ggplot(aes(reorder(buyer_region, total_purchased), total_purchased)) +
geom_col(fill = "limegreen") +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Total Sales Per State", x = "State", y = "Total in Sales (Dollars)") +
coord_flip()
To further examine the influence of the buyer region we look at the
average sale price for a the shoes sold in each state. However, the
average sale price also has to do
with the types of shoes sold in each state too.
sneakers %>%
group_by(buyer_region) %>%
summarise(avg_purchased = mean(sale_price)) %>%
ggplot(aes(reorder(buyer_region, avg_purchased), avg_purchased)) +
geom_col(fill = "limegreen") +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Average Sale Price Per State", x = "State", y = "Average Shoe Price (Dollar)") +
coord_flip()
Likewise, we will also look at the average shoe size that each state buys. This is relevant because some sizes are more attainable than others, therefore we can get an idea of which states perhaps buy rarer shoes.
sneakers %>%
select(buyer_region, shoe_size) %>%
group_by(buyer_region) %>%
summarise(avg_shoe_size = mean(shoe_size)) %>%
ggplot(aes(reorder(buyer_region, avg_shoe_size), avg_shoe_size)) +
geom_point(color = "limegreen") +
labs(title = "Average Shoe Size by State", x = "State", y = "Average Shoe Size") +
coord_flip()
After looking at shoe size in states, we will examine the influence that shoe size has on the sales.
In the graph, we notice that there is a normal distribution for the sizes of the shoes which is expected.
sneakers %>%
group_by(shoe_size) %>%
count() %>%
ggplot(aes(shoe_size, n)) +
geom_point(color = "limegreen", size = 3) +
scale_x_continuous(breaks = 3:17) +
labs(title = "Total Number of Sales Per Shoe Size", x = "Shoe Size", y = "Number of Sales")
The following graph shows the shoe size with the profit percent that was made on each sale. The scatterplot is utilized to be able to identify shoes that are outliers that contribute to a higher percent profit.
sneakers %>%
select(shoe_size, profit_perc) %>%
group_by(shoe_size) %>%
ggplot(aes(x = shoe_size, y = profit_perc)) +
geom_point(color = "limegreen") +
scale_x_continuous(breaks = 3:17) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Total Profit Percent Per Shoe Size", x = "Shoe Size", y = "Percent of Profit")
After looking at the percent profit, it is critical to look over the average sale price that each shoe size sells for. Because larger shoe sizes are harder to acquire it is reasonable to note that they sell for higher prices on average as shown in the graph.
sneakers %>%
select(shoe_size, sale_price) %>%
group_by(shoe_size) %>%
summarise(avg_sale_price = round(mean(sale_price))) %>%
ggplot(aes(x = shoe_size, y = avg_sale_price)) +
geom_col(fill = "limegreen") +
scale_x_continuous(breaks = 3:17) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Average Sale Price Per Shoe Size", x = "Shoe Size", y ="Average Sale Price")
Lastly, we will analyze how the time of an order compared to the release date of each shoe can affect the sales of each shoe.
The first graph that we used was to see how the percent profit of the two brands across the weeks of their sale. It is also important to note that Off-White sales do not go past 75 weeks, while Yeezy sales do. Also it is seen that Off-White shoes tend to make a higher percentage of profit in comparison to Yeezy.
sneakers %>%
ggplot(aes(x = as.numeric(weeks), y = profit_perc, color = factor(brand))) +
geom_point() +
scale_colour_manual(values = c("#7CFC00", "#228B22")) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Weeks when Shoes Sold for Most Profit", x = "Weeks", y = "Percent of Profit")
Another graph that was made was the following to see what weeks made the most sales in comparison to their release dates. Here it is clear that many sales were made close to the release date.
sneakers %>%
group_by(weeks) %>%
count() %>%
ggplot(aes(as.numeric(weeks), n)) +
geom_point(color = "limegreen", size = 3) +
labs(title = "Weeks when the Most Shoes Were Sold", x = "Weeks", y = "Number of Shoes")
Out of curiosity, a graph displaying which day of the week sold the most shoes was made. Based on the graph, Friday seems to be the most popular day to purchase shoes.
sneakers %>%
group_by(day_of_week) %>%
count() %>%
ggplot(aes(n, day_of_week)) +
geom_col(fill = "limegreen") +
labs(title = "Number of Shoes Sold Per Weekday", x = "Weekday", y = "Number of Shoes") +
scale_y_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) +
coord_flip()
Although Friday is the day that sells the most shoes, Wednesday is the day that makes the most profit, as seen in the graph.
sneakers %>%
group_by(day_of_week) %>%
summarise(avg_profit_perc = round(mean(profit_perc))) %>%
ggplot(aes(reorder(x = day_of_week, avg_profit_perc), y = avg_profit_perc)) +
geom_col(fill = "limegreen") +
labs(title = "Average Profit Percent per Weekday", x = "Weekday", y = "Average Percent of Profit") +
scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) +
coord_flip()
After exploring our data and understanding the correlation that the variables have with each other, it is time to split the data. However before we do that it is important to convert certain variables into factors so that we can later use them in our recipe.
#Convert variables to factors
sneakers$brand <- as.factor(sneakers$brand)
sneakers$sneaker_name <- as.factor(sneakers$sneaker_name)
sneakers$buyer_region <- as.factor(sneakers$buyer_region)
sneakers$day_of_week <- as.factor(sneakers$day_of_week)
sneakers$weeks <- as.numeric(sneakers$weeks)
Next we split the data and stratify it. When we stratify the data we make sure the data is being split evenly and between training and testing data sets. This makes sure that each set contains the same proportions of the response variable to reflect the entire data set overall. This helps the process of modeling because it allows the models to work with a smaller version of the overall data in order to acheive higher accuracy in predictions
Afterwards the split data will be assigned as training or testing. And before creating the recipe, I will fold my training data into 10 folds, with 2 repeats in order to be able to perform cross validation. Stratified cross validation will enhance the modeling process because it will further break down the training data set into smaller data sets that will be ran to create mini models, and essentially the best model will be used on the overall training data, and this too will help in determining the best model overall with the highest accuracy.
Originally, I had set up my cross validation repeats to equal 5, however, because I was working with a data set with a large amount of observations and many models took a long time to run, I decided that 2 repeats would be more appropriate.
set.seed(0714)
# Split the data
sneakers_split <- initial_split(sneakers, strata = profit_perc, prop = 0.7)
# Assign the training and testing data
sneakers_train <- training(sneakers_split)
sneakers_test <- testing(sneakers_split)
# V-fold cross validation
sneakers_fold <- vfold_cv(sneakers_train, strata = profit_perc, v = 10, repeats = 2)
Before creating our recipe, a correlation plot will be made in order to examine which variables are positively or negatively correlated with each other. This will provide a vague idea of which variables to use in our recipe to predict our response variable. For this data set I have decided to make the profit percentage the response variable, therefore I will note any variables that have high correlation to the profit percentage. As seen in the plot, sale price and, of course, profit are extremely correlated to the profit percentage, therefore they will not be included in the recipe in order to avoid multicollinearity.
sneakers_train2 <- sneakers_train[,sapply(sneakers_train, is.numeric)]
sneakers_train2 %>%
cor() %>%
corrplot(type = "lower", diag = FALSE,
method = 'color', addCoef.col = "Black")
Finally, we will create our recipe that will be predicting the profit percentage of the shoe sales, given the variables seen in the code below. Additionally, we will dummy code all the nominal predictors, as well as centering and scaling them.
sneakers_recipe <- recipe(
# predict the profit percent using all the predictors except, profit, release_date, order_date.
# weeks will account for the dates that are not being included
# will use the training data set for this recipe
profit_perc ~ brand + sneaker_name + retail_price + buyer_region + shoe_size + weeks + day_of_week, data = sneakers_train) %>%
# dummy code all the variables
step_dummy(all_nominal_predictors()) %>%
# scale and center all the predictor variables
step_normalize(all_predictors())
Now that we have our recipe made we can begin the modeling process. Because the response variable is a continuous quantitative variable, I will be making regression models. The models that will be used for this data set are: * Linear Regression Model
Because we are dealing with a regression problem, I began with a simple linear regression model. For this model I will not be using cross validation. First I create the model and the workflow, as well as add the recipe.
# Create the model
lm_model <- linear_reg() %>%
set_engine("lm")
# Create the workflow
lm_wf <- workflow() %>%
add_model(lm_model) %>%
add_recipe(sneakers_recipe)
Next I fit the training data to the model.
# Fit model to training set
lm_fit <- fit(lm_wf, sneakers_train)
Afterwards I make a table that will list the prediction values of the model along with the actual values.
# Table w/ predicted and actual values
sneakers_train_res <- predict(lm_fit, new_data = sneakers_train %>% select(-profit_perc))
sneakers_train_res <- bind_cols(sneakers_train_res, sneakers_train %>% select(profit_perc))
In order to understand the performance of this regression model, I outputted the RMSE, \(R^2\), and the MAE using the following code. The RMSE tells us how well the regression model can predict the value of the response variable in absolute terms. On the other hand, \(R^2\) tells us how well a model can predict the value of the response variable in percentage terms, or in this case in the form of a decimal. The MAE tells us how accurate our predictions are and, what the amount of deviation from the actual values is.
# Metric Set w/ RMSE, MSE, R^2
sneakers_metrics <- metric_set(rmse, rsq, mae)
sneakers_metrics(sneakers_train_res, truth = profit_perc,
estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 55.0
## 2 rsq standard 0.868
## 3 mae standard 30.3
Finally, in order to visually see how the model performed, I graphed the predictions versus the actual responses. For this model, it seems like the model somewhat follows the direction of the line, however there are certain predictions that are either overestimated or underestimated.
# Plot for Predicted vs Actual Values
sneakers_train_res %>%
ggplot(aes(x = .pred, y = profit_perc)) +
geom_point(alpha = 0.2) +
geom_abline(lty = 2) +
theme_bw() +
coord_obs_pred()
Next we will be fitting the training data into Ridge and Lasso regression models.
Similar to linear regression, we will create the model, and workflow. However for these models, we will be tuning them, therefore we will use our folds that we created for cross-validation.
# Create a model
ridge_spec <-
linear_reg(penalty = tune(), mixture = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Create a workflow
ridge_wf <- workflow() %>%
add_recipe(sneakers_recipe) %>%
add_model(ridge_spec)
Next, the tuning grid will be set up that will determine the parameters for this model. The parameter tuned in this model will be penalty, while mixture will be set as 0 in order for it to be classified as a ridge regression model.
penalty_grid_r <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
Finally, the model will be tuned, and cross-validation folds will be fit with the workflow as well.
tune_res_ridge <- tune_grid(
ridge_wf,
resamples = sneakers_fold,
grid = penalty_grid_r,
control = control_grid(verbose = TRUE)
)
Next, it is helpful to visualize the results, therefore autoplot will be used.
autoplot(tune_res_ridge)
In order to know which model performed the best I will collect the
metrics of all the models and choose the best.
collect_metrics(tune_res_ridge)
## # A tibble: 100 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00001 rmse standard 56.5 20 0.455 Preprocessor1_Model01
## 2 0.00001 rsq standard 0.862 20 0.00159 Preprocessor1_Model01
## 3 0.0000160 rmse standard 56.5 20 0.455 Preprocessor1_Model02
## 4 0.0000160 rsq standard 0.862 20 0.00159 Preprocessor1_Model02
## 5 0.0000256 rmse standard 56.5 20 0.455 Preprocessor1_Model03
## 6 0.0000256 rsq standard 0.862 20 0.00159 Preprocessor1_Model03
## 7 0.0000409 rmse standard 56.5 20 0.455 Preprocessor1_Model04
## 8 0.0000409 rsq standard 0.862 20 0.00159 Preprocessor1_Model04
## 9 0.0000655 rmse standard 56.5 20 0.455 Preprocessor1_Model05
## 10 0.0000655 rsq standard 0.862 20 0.00159 Preprocessor1_Model05
## # … with 90 more rows
best_penalty_r <- select_best(tune_res_ridge, metric = "rsq")
Finally, I will use the best model to fit the overall training data into.
ridge_final <- finalize_workflow(ridge_wf, best_penalty_r)
ridge_final_fit <- fit(ridge_final, data = sneakers_train)
We will evaluate the performance that the model has on the training data. This evaluation will be used later on to compare the performance between all of the models.
augment(ridge_final_fit, new_data = sneakers_train) %>%
sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 56.4
## 2 rsq standard 0.863
## 3 mae standard 31.4
This process will be similar to the same process as Ridge Regression.However in this model mixture will be set to 1, and the same parameter will be tuned, which is penalty. That is what classifies this model as a lasso regression model.
# Create the model
lasso_spec <-
linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Create the workflow
lasso_wf <- workflow() %>%
add_recipe(sneakers_recipe) %>%
add_model(lasso_spec)
# Create the tuning grid
penalty_grid_l <- grid_regular(penalty(range = c(-2, 2)), levels = 50)
tune_res_lasso <- tune_grid(
lasso_wf,
resamples = sneakers_fold,
grid = penalty_grid_l,
control = control_grid(verbose = TRUE)
)
# Visualize results
autoplot(tune_res_lasso)
# See performance of all models
collect_metrics(tune_res_lasso)
## # A tibble: 100 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.01 rmse standard 55.3 20 0.439 Preprocessor1_Model01
## 2 0.01 rsq standard 0.867 20 0.00164 Preprocessor1_Model01
## 3 0.0121 rmse standard 55.3 20 0.439 Preprocessor1_Model02
## 4 0.0121 rsq standard 0.867 20 0.00164 Preprocessor1_Model02
## 5 0.0146 rmse standard 55.3 20 0.439 Preprocessor1_Model03
## 6 0.0146 rsq standard 0.867 20 0.00164 Preprocessor1_Model03
## 7 0.0176 rmse standard 55.3 20 0.440 Preprocessor1_Model04
## 8 0.0176 rsq standard 0.867 20 0.00164 Preprocessor1_Model04
## 9 0.0212 rmse standard 55.3 20 0.440 Preprocessor1_Model05
## 10 0.0212 rsq standard 0.867 20 0.00164 Preprocessor1_Model05
## # … with 90 more rows
# Choose best model
best_penalty_l <- select_best(tune_res_lasso, metric = "rsq")
# Fit best model to training data
lasso_final <- finalize_workflow(lasso_wf, best_penalty_l)
lasso_final_fit <- fit(lasso_final, data = sneakers_train)
# Evaluate performance of model on training data
augment(lasso_final_fit, new_data = sneakers_train) %>%
sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 55.2
## 2 rsq standard 0.867
## 3 mae standard 30.4
Next tree-based models are going to be created with the folded data.
For this model the parameter that will be tuned is cost_complexity. The same first few steps from the previous models produced are executed for this model too.
# Create the model
reg_tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
# Create the workflow
reg_tree_wf <- workflow() %>%
add_model(reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(sneakers_recipe)
# Create tuning grid
param_grid_dt <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 10)
tune_res_dt <- tune_grid(
reg_tree_wf,
resamples = sneakers_fold,
grid = param_grid_dt,
control = control_grid(verbose = TRUE)
)
By using the autoplot function, it can be shown how the value of the cost complexity parameter affects the RMSE and \(R^2\).
# Shows which value of cost complexity produces best rmse and rsq
autoplot(tune_res_dt)
Again like previously seen, the best model will be chosen and fit to the
training data and the performance will be evaluated.
# See performance of all models
collect_metrics(tune_res_dt)%>%
arrange(std_err) %>%
head()
## # A tibble: 6 × 7
## cost_complexity .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000464 rsq standard 0.947 20 0.000937 Preprocessor1_Model03
## 2 0.000215 rsq standard 0.958 20 0.000959 Preprocessor1_Model02
## 3 0.0001 rsq standard 0.966 20 0.000965 Preprocessor1_Model01
## 4 0.00464 rsq standard 0.862 20 0.00113 Preprocessor1_Model06
## 5 0.001 rsq standard 0.929 20 0.00126 Preprocessor1_Model04
## 6 0.00215 rsq standard 0.902 20 0.00143 Preprocessor1_Model05
# Choose best model
best_complexity <-select_best(tune_res_dt, metric = "rmse")
# Fit best model to training data
reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity)
reg_tree_final_fit <- fit(reg_tree_final, data = sneakers_train)
Next a visualization of the decision tree will be produced.
# Visualization of the tree
reg_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
# Evaluate performance
augment(reg_tree_final_fit, new_data = sneakers_train) %>%
sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 26.3
## 2 rsq standard 0.970
## 3 mae standard 14.8
We will repeat the same process for this model as was seen in the decision tree model. The difference between the random forest model and decision trees is that this model has three parameters that will be tuned: mtry, trees, and min_n. The hyperparameter mtry represents the number of predictors that will be randomly sampled at each split when creating the tree models. The hyperparameter trees represents the number of trees contained in the ensemble. The hyperparameter min_n represents the minimum number of data points in a node that are required for the node to be split further.
# Create the model
rf_spec <- rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Create the workflow
rf_wf <- workflow() %>%
add_model(rf_spec %>% set_args(mtry = tune(), trees = tune(), min_n = tune())) %>%
add_recipe(sneakers_recipe)
# Create tuning grid
param_grid_rf <- grid_regular(mtry(range = c(1, 7)), trees(range = c(50, 100)), min_n(range = c(1, 5)), levels = 5)
tune_res_rf <- tune_grid(
rf_wf,
resamples = sneakers_fold,
grid = param_grid_rf,
metrics = metric_set(rmse),
control = control_grid(verbose = TRUE)
)
The visualization of this model shows how the number of trees perform for each min_n and how it affects the RMSE.
autoplot(tune_res_rf)
Again like previously seen, the best model will be chosen and fit to the
training data and the performance will be evaluated.
# See performance of all models
collect_metrics(tune_res_rf) %>%
head()
## # A tibble: 6 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 50 1 rmse standard 126. 20 0.893 Preprocessor1_Model0…
## 2 2 50 1 rmse standard 90.9 20 0.613 Preprocessor1_Model0…
## 3 4 50 1 rmse standard 59.4 20 0.416 Preprocessor1_Model0…
## 4 5 50 1 rmse standard 51.2 20 0.466 Preprocessor1_Model0…
## 5 7 50 1 rmse standard 41.2 20 0.395 Preprocessor1_Model0…
## 6 1 62 1 rmse standard 125. 20 0.994 Preprocessor1_Model0…
# Choose best model
best_forest <-select_best(tune_res_rf, metric = "rmse")
# Fit best model to training data
rf_final <- finalize_workflow(rf_wf, best_forest)
rf_final_fit <- fit(rf_final, data = sneakers_train)
# Evaluate performance
augment(rf_final_fit, new_data = sneakers_train) %>%
sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 38.1
## 2 rsq standard 0.942
## 3 mae standard 21.2
The last model was a model that I attempted to fit however I was not the most confident in running it because we did not get much practice with it in this course. For this model the parameter neighbors was set to be tuned.
# Create the model
knn_model <- nearest_neighbor(
neighbors = tune(),
mode = "regression") %>%
set_engine("kknn")
# Create the workflow
knn_wf <- workflow() %>%
add_model(knn_model) %>%
add_recipe(sneakers_recipe)
# Set-up tuning grid
knn_params <- parameters(knn_model)
# Define grid
knn_grid <- grid_regular(knn_params, levels = 2)
# Tune the grid
knn_tune <- knn_wf %>%
tune_grid(
resamples = sneakers_fold,
grid = knn_grid,
control = control_grid(verbose = TRUE)
)
The autoplot function was used to show how the number of neighbors affect the RMSE.
#Visualization of the # of neighbors
autoplot(knn_tune, metric = "rmse")
Next we choose the best model, fit our training data, and evaluate the
performance of the model on the training data set.
# See performance of all models
collect_metrics(knn_tune)
## # A tibble: 4 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 56.0 20 0.586 Preprocessor1_Model1
## 2 1 rsq standard 0.866 20 0.00224 Preprocessor1_Model1
## 3 15 rmse standard 51.7 20 0.478 Preprocessor1_Model2
## 4 15 rsq standard 0.885 20 0.00165 Preprocessor1_Model2
# Choose best model
best_knn <- select_best(knn_tune, metric = "rsq")
# Fit best to training data
knn_final <- finalize_workflow(knn_wf, best_knn)
knn_final_fit <- fit(knn_final, data = sneakers_train)
# Evaluate performance
augment(knn_final_fit, new_data = sneakers_train) %>%
sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 39.1
## 2 rsq standard 0.935
## 3 mae standard 18.2
Next we will look at the performance of each model and compare to see which model performed best on the training data. For the linear regression model, it performed with the following metrics: * RMSE: 54.9713892
\(R^2\): 0.8682691
MAE: 30.2680065
For the ridge regression model, it performed with the following metrics: * RMSE: 56.4243079
\(R^2\): 0.8626641
MAE: 31.3720788
For the lasso regression model, it performed with the following metrics: * RMSE: 55.207143
\(R^2\): 0.867142
MAE: 30.440152
For the decision tree model, it performed with the following metrics: * RMSE: 26.3174108
\(R^2\): 0.9698074
MAE: 14.7527728
For the random forest mode, it performed with the following metrics: * RMSE: 38.1130937
\(R^2\): 0.9420715
MAE: 21.1902408
For the KNN model, it performed with the following metrics: * RMSE: 39.0606262
\(R^2\): 0.9350643
MAE: 18.2064622
Based on the results the models that performed, ordered in best performing to least are: * Decision Tree
Random Forest
KNN
Linear Regression
Lasso regression
Ridge Regression
Therefore based on these results, I will proceed to fit my testing data with the decision tree model.
I will fit the testing data with the best performing model.
# Fit testing data
rf_final_fit_test <- fit(rf_final, data = sneakers_test)
After looking at the matrix, we can see how the model actually performed by outputting the RMSE, \(R^2\), and MAE.
# Evaluate Performance for Testing Data
augment(rf_final_fit_test, new_data = sneakers_test) %>%
sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 39.6
## 2 rsq standard 0.937
## 3 mae standard 22.2
Based on the outcome, the decision tree model performed with an \(R^2\) of 0.93, which means that 93% percent of the data was fitted accurately to the model predictions. Overall, this project has analyzed each of the variables in the data set and used them to predict the profit percentage that the company StockX will make off the sale of a shoe. For this data set, the decision tree model was the machine learning model with the best performance.